MY472 Summative 3 Project
What, if any, characteristics and factors discriminate MPs who tend to ask questions about economic issues from MPs who tend to ask questions about health and welfare issues?
Link to public GitHub repository: https://github.com/lenmetson/my472-final-assignment.
Words: 722
Introduction
Parliamentary questions are a critical part of how MPs represent their constituency. Therefore, I focus on asking what factors about the constituency an MP represents drives their focus on economic or health and welfare topics in 2023.
Data
I drew on data from two sources: the UK Parliament API (“API”), and the UK House of Commons constituency dashboard (“dashboard”). I limit my analysis to questions asked in 2023. I store data efficiently, I used a local relational database.
API
First, I pulled the text of questions from the API oral and written question endpoints. For oral questions, the API only returns questions asked in the House of Commons (HoC). However, the written question endpoint returns questions from both the House of Lords (HoL) and the HoC. Therefore, I added a parameter to the request URL to only return written questions asked by members of the HoC. I merged both types of question into one table: questions.
For each question, I wanted to be able to pull in additional data about the MP who had asked it and the minister they asked it to. Some MP characteristics, such as their party affiliation or seat, change over time. The API members endpoint allows queries to specify date. It then returns data as valid from that date. Thus, I queried the API for each unique MP-date pairs from my questions table.
However, I did not want to write out a table with data on each MP for every day they had asked a question. Therefore, I grouped the clean response table by each unique combination of MP and their characteristics and summarised the earliest and latest date this combination was valid for.
Dashboard
The data on the demographics of constituencies from the UK Parliament API is very limited. Therefore, I used the Commons Library constituency dashboard to add demographic variables. This data source does not have an API endpoint and requires each constituency to be looked up using a search tool. Therefore, I used Selenium to interactively scrape the data.
To create the constituencies table, I merged the scraped data with election results and shapefiles pulled from the API.
Measurement
To measure whether a question is about (1) economic issues or (2) health and welfare, I use a simple dictionary approach. Whilst limited compared to machine learning classification approaches, dictionary string matching was more feasible for this project as it does not require expert labelling of a training set. I used pre-defined policy topic dictionaries from Lexicoder (Albugh, et al., 2013). I wrote out the results of my measurement to a table called question_topics.
I wrote out the results of my measuremnt to a table called question_topics.
Final database
This resulted in 5 tables in my local database:
- questions
- members
- constituencies
- parties
- question_topics
Analysis
To answer which factors about an MP’s constituency discriminate whether they ask questions on economic or health & welfare topics, I construct a measure of slant at the MP level, defined as:
\[ slant = \frac{N (economic)}{N(questions)} - \frac{N(health\_welfare)}{N(questions)} \]
This measure ranges from -1 to 1. A value of -1 indicates a perfect health and welfare slant – I.e. the MP only asks health & welfare questions - and a value of 1 indicates a perfect economic slant.
Figure 1 shows that most MPs have no slant and that there is a slight skew towards health & welfare questions.
Figure 2 shows the geographic distribution of economic slant across the UK. Whilst there is no clear regional pattern, it seems that small clusters of urban areas, such as the North of Wales and North West of England appear to be slightly more focused on health & welfare.
To explore this further, Figure 3 shows, that population density is correlated with having a stronger focus on health & welfare compared to economic issues. This suggests that one factor which drives whether MPs focus on different topics in their questions is how urban the constituency they represent is.
Figure 4 shows the correlation between slant and the number of households claiming universal credit (UC) in the constituency. Surprisingly, there is no overall relationship between the density of UC claimants and focus on health and welfare issues.
Finally, Figure 6 shows that the slant of an MP’s question is uncorrelated with the marginality of their seat.
Overall, I find that characteristic of an MPs constituency do not particularly discriminate whether they focus on economic or health & welfare issues.
References
Albugh, Quinn, Julie Sevenans and Stuart Soroka. 2013. Lexicoder Topic Dictionaries, June 2013 versions, McGill University, Montreal, Canada.
Code appendix
knitr::opts_chunk$set(
echo = FALSE,
eval = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
fig.width=7, fig.height=6,
fig.align='center')
# PICKUP
# Can also be run by sourcing scripts/00_setup.R
# Define function to install or load packages
load_packages <- function(x) {
y <- x %in% rownames(installed.packages())
if(any(!y)) install.packages(x[!y])
invisible(lapply(x, library, character.only=T))
rm(x, y)
}
# Load required packagess
load_packages(c(
"tidyverse",
"here",
# Database management
"DBI",
"RSQLite",
# APIs and webscraping
"httr",
"RSelenium",
# Text analysis
"tm",
"XML",
# Geospatial plots
"tmap",
"sf",
"ggrepel",
# Random forests
"parallel",
"ranger",
"tidymodels",
"vip",
"rpart",
"rpart.plot",
# Plotting
"gridExtra"
))
replace_null_with_na <- function(x) {
if (is.list(x)) { # Checks for whether the item is a sublist
lapply(x, replace_null_with_na) # if it is, apply the function for each of the elements within the sublist
} else { # If it isn't, simply apply the main function
ifelse(is.null(x) || x == "null", "NA", x)
}
}
replace_na_chr <- function(df) { # NOTE function adapted from ChatGPT output
df <- df %>%
mutate(across(where(is.character), ~na_if(., "NA")))
return(df)
}
db_table_check <- function(table){
rows <- dbGetQuery(db, paste0("SELECT COUNT(1) FROM ", table))
cols <- dbListFields(db, table)
result = list(
table = table,
n_rows = rows[[1]],
col_names = cols)
return(result)
}
db <- DBI::dbConnect(RSQLite::SQLite(), here("data/parliament_database.sqlite"))
# This code can also be run by sourcing scripts/01_pull-oral-questions.R
GET_qs <- function(endpoint_url, n_skip = 0) {
url <- paste0(
endpoint_url,
"?parameters.skip=",
n_skip,
"¶meters.answeringDateStart=2023-01-01¶meters.answeringDateEnd=2023-12-31", # Limit to 2023
"¶meters.take=100")
response <-
httr::GET(url) %>%
httr::content("parsed") # Use :: because tm masks content
return(response)
}
# Define functions to pull all questions
pull_all_oral_qs <- function(endpoint_url){
# Calculate how many questions are in the end point
n_resp <- httr::GET(paste0(
endpoint_url,
"?parameters.answeringDateStart=2023-01-01¶meters.answeringDateEnd=2023-12-31", # Limit to 2023
"¶meters.take=1")) %>%
httr::content("parsed")
n <- n_resp$PagingInfo$GlobalTotal
# Questions can be pulled in batches of 100,
# calculate how many time we will have to pull
n_loops <- ceiling(n / 100)
print(paste0("LOG | ", Sys.time(), " | Oral question pull starting"))
for (i in 1:n_loops) {
n_skip <- (i - 1) * 100 # Skip however many 100s the loop has run
if (i == 1) { # On first iteration, make new list
response <- GET_qs(endpoint_url, n_skip)
response <- response$Response
} else { # On all other iterations, append to existing list
response_new <- GET_qs(endpoint_url, n_skip)
response_new <- response_new$Response
response <- c(response, response_new) # Merge responses
}
print(paste0("LOG | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message
Sys.sleep(1) # Sleep to avoid hammering the API
}
print(paste0("LOG | ", Sys.time(), " | Oral question pull done :)"))
return(response)
}
## APPLY FUNCTIONS
oral_questions <- pull_all_oral_qs(
"https://oralquestionsandmotions-api.parliament.uk/oralquestions/list")
saveRDS(oral_questions, "data/oral_questions_2023.RDS")
# This code can also be run by sourcing scripts/02_pull-written-questions.R
GET_qs_written <- function(endpoint_url, n_skip = 0) {
url <- paste0(
endpoint_url,
"?skip=",
n_skip,
"&tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
"&house=Commons", # Limit to HoC
"&take=100")
response <-
httr::GET(url) %>%
httr::content("parsed") # Use :: because tm masks content
return(response)
}
pull_all_written_qs <- function(endpoint_url){
n_resp <- httr::GET(
paste0(
endpoint_url,
"?tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
"&house=Commons", # Limit to HoC
"&take=1")) %>%
httr::content("parsed")
n <- n_resp$totalResults
# Questions can be pulled in batches of 100, calculate how many time we will have to pull
n_loops <- ceiling(n/100)
for(i in 1:n_loops){
n_skip <- (i-1)*100 # Skip however many 100s the loop has run
if(i==1){ # On first iteration, make new list
response <- GET_qs_written(endpoint_url, n_skip)
response <- response$results
} else { # On all other iterations, append to existing list
responseNew <- GET_qs_written(endpoint_url, n_skip)
responseNew <- responseNew$results
response <- c(response, responseNew) # Merge responses
}
print(paste0("LOG | Written questions | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message
Sys.sleep(0.5) # Sleep to avoid hammering the API
}
print(paste0("LOG | Written questions | ", Sys.time(), " | Written question pull done :)"))
return(response)
}
written_questions <- pull_all_written_qs("https://questions-statements-api.parliament.uk/api/writtenquestions/questions")
saveRDS(written_questions, "data/written_questions_2023.RDS")
oral_questions <- readRDS("data/oral_questions_2023.RDS")
for (i in seq_along(oral_questions)) {
# remove sublists, otherwise names do not match
oral_questions[[i]]$AskingMember <- NULL
oral_questions[[i]][["AnsweringMinister"]] <- NULL
oral_questions[[i]] <- replace_null_with_na(oral_questions[[i]])
if (i == 1){
oral_question_df <- data.frame(oral_questions[[i]])
} else {
oral_question_df2 <- data.frame(oral_questions[[i]])
oral_question_df <- rbind(oral_question_df, oral_question_df2)
}
}
rm(oral_question_df2, i)
### Clean dataframes and merge into one table ####
oral_question_df <- oral_question_df %>%
select(
question_id = Id,
question_text = QuestionText,
asking_member = AskingMemberId,
question_tabled_when = TabledWhen,
question_answering_when = AnsweringWhen,
question_answering_body = AnsweringBody,
question_answering_body_id = AnsweringBodyId,
answering_member = AnsweringMinisterId) %>%
# Ensure variables are the correct class
mutate(
question_id = as.character(question_id),
question_text = as.character(question_text),
asking_member = as.character(asking_member),
question_tabled_when = as.character(as.Date(question_tabled_when)),
question_answering_when = as.character(as.Date(question_answering_when)),
question_answering_body = as.character(question_answering_body),
question_answering_body_id = as.character(question_answering_body_id),
answering_member = as.character(answering_member),
oral_written = "oral") # add written_oral dummy
written_questions <- readRDS("data/written_questions_2023.RDS")
for (i in seq_along(written_questions)) {
# Remove links sublist by keeping only "value"
written_questions[[i]] <- written_questions[[i]]$value
# remove sublists, otherwise names do not match
written_questions[[i]]$groupedQuestions <- NULL
written_questions[[i]]$attachments <- NULL
written_questions[[i]]$groupedQuestionsDates <- NULL
# Replace nulls with NAs
written_questions[[i]] <- replace_null_with_na(written_questions[[i]])
if (i == 1){
written_question_df <- data.frame(written_questions[[i]])
} else {
written_question_df2 <- data.frame(written_questions[[i]])
written_question_df <- rbind(written_question_df, written_question_df2)
}
print(paste0(i, " of ", length(written_questions)))
}
rm(written_question_df2, i)
written_question_df <- written_question_df %>%
select(
question_id = id,
question_text = questionText,
asking_member = askingMemberId,
question_tabled_when = dateTabled,
question_answering_when = dateForAnswer,
question_answering_body = answeringBodyName,
question_answering_body_id = answeringBodyId,
answering_member = answeringMemberId) %>%
# Ensure variables are the correct class
mutate(
question_id = as.character(question_id),
question_text = as.character(question_text),
asking_member = as.character(asking_member),
question_tabled_when = as.character(as.Date(question_tabled_when)),
question_answering_when = as.character(as.Date(question_answering_when)),
question_answering_body = as.character(question_answering_body),
question_answering_body_id = as.character(question_answering_body_id),
answering_member = as.character(answering_member),
oral_written = "written") # add written_oral dummy
question_df <- rbind(oral_question_df, written_question_df)
dbWriteTable(db, "questions", question_df, overwrite = TRUE)
rm(oral_question_df, written_question_df, question_df)
# This code can also be run by sourcing scripts/03_pull-members-endpoint.R
pull_members <- function(base_url, df) {
for (i in seq_along(df$member_id)) {
url <- paste0( # Build request URL
base_url, "/",
df$member_id[i],
"?detailsForDate=",
df$question_tabled_when[i])
if (i == 1) { # If 1st iteration, create response,
response <- httr::GET(url) %>% httr::content("parsed") # Pull request
response <- response[1] # Extract list with response
response <- c(
date = df$question_tabled_when[i], response[[1]]) # Merge with date
response <- list(response) # Convert to list
} else { # else create response2, then merge
response_new <- httr::GET(url) %>% httr::content("parsed")
response_new <- response_new[1]
response_new <- c(
date = df$question_tabled_when[i], response_new[[1]])
response_new <- list(response_new)
response <- c(response, response_new) # Merge responses
}
Sys.sleep(1)
print(paste0("LOG Member Pull | ", Sys.time(), " | ", i, " of ", nrow(df), " done"))
}
return(response)
}
# Query question table to get MP-date pairs
members_asking <- dbGetQuery(db,
"
SELECT
asking_member AS member_id,
question_tabled_when
FROM questions
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
ministers_answering <- dbGetQuery(db,
"
SELECT
answering_member AS member_id,
question_tabled_when
FROM questions
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
q_parameters <- rbind(members_asking, ministers_answering)
# Only keep unique MP-date pairs to avoid pulling the same information twice
q_parameters <- unique(q_parameters) %>%
filter(member_id != 0) # Remove 0s because these indicate no minister has answered
# Apply function to pull members
members <- pull_members(
"https://members-api.parliament.uk/api/Members",
q_parameters)
saveRDS(members, "data/members_raw.Rds")
members <- readRDS("data/members_raw.Rds")
# Replace "null" values with NA so they are kept in the structure of the list
members <- lapply(
members,
function(x) {lapply(x, replace_null_with_na)})
# Convert to dataframe
for (i in seq_along(members)) {
if (i == 1) {
members_df <- members[i] %>%
unlist() %>%
t() %>%
data.frame()
} else {
members_df_new <- members[i] %>%
unlist() %>%
t() %>%
data.frame()
members_df <- rbind(members_df, members_df_new)
}
}
members_df <- members_df %>%
select(
member_date_valid = date,
member_id = id,
name_display = nameDisplayAs,
gender = gender,
latest_constituency = latestHouseMembership.membershipFromId,
latest_party_id = latestParty.id
)
# Some MP characteristics change over time, so we collected unique MP-day queries.
# However, characteristics do not change daily so there is lots of repitition.
# The following code groups MPs by the mutable variables (i.e. unique combinations,
# then summarises the earliest valid, and the latest valid date)
# Before this function there are 4225 observations, and after, only 482.
members_df_grouped <- members_df %>%
group_by( # Group by all variables apart from date
member_id,
name_display,
gender,
latest_constituency,
latest_party_id
) %>%
summarize( # Summarise earliest date this is valid for and latest. This gives us a range of vlaues where this combination is duplicated
member_date_valid_min = min(member_date_valid),
member_date_valid_max = max(member_date_valid)
) %>%
mutate(
member_id = as.character(member_id),
name_display = as.character(name_display),
gender = as.character(gender),
latest_constituency = as.character(latest_constituency),
latest_party_id = as.character(latest_party_id),
member_date_valid_min = as.character(member_date_valid_min),
member_date_valid_max = as.character(member_date_valid_max)
)
#unique(members_df$member_id) %>% length() # This returns 474, indicating there are changes
dbWriteTable(db, "members", members_df_grouped, overwrite = TRUE)
# This code can also be run by sourcing scripts/04_pull-constituency-endpoints.R
MPs <- dbGetQuery(db,
"
SELECT *
FROM members
") %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
constituencies <- MPs$latest_constituency %>%
unique()
constituencies <-
data.frame(
constituency_id = constituencies
) %>%
mutate(
cons_name = NA,
cons_start_date = NA,
cons_end_date = NA,
last_election_1_electorate = NA,
last_election_1_turnout = NA,
last_election_1_majority = NA,
last_election_1_result = NA,
last_election_1_winning_party = NA,
last_election_1_election_ID = NA,
last_election_1_electionDate = NA,
last_election_1_isGeneralElection = NA,
last_election_2_electorate = NA,
last_election_2_turnout = NA,
last_election_2_majority = NA,
last_election_2_result = NA,
last_election_2_winning_party = NA,
last_election_2_election_ID = NA,
last_election_2_electionDate = NA,
last_election_2_isGeneralElection = NA,
last_election_3_electorate = NA,
last_election_3_turnout = NA,
last_election_3_majority = NA,
last_election_3_result = NA,
last_election_3_winning_party = NA,
last_election_3_election_ID = NA,
last_election_3_electionDate = NA,
last_election_3_isGeneralElection = NA,
last_election_4_electorate = NA,
last_election_4_turnout = NA,
last_election_4_majority = NA,
last_election_4_result = NA,
last_election_4_winning_party = NA,
last_election_4_election_ID = NA,
last_election_4_electionDate = NA,
last_election_4_isGeneralElection = NA,
shapefile = NA
)
### Pull basic details
pull_const_info <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id)
basic_info <- httr::GET(url) %>%
httr::content("parsed")
return(basic_info)
}
for(i in seq_along(constituencies$constituency_id)) {
response <- pull_const_info(constituencies$constituency_id[i])
response <- response[[1]]
constituencies$cons_name[i] <- response$name
constituencies$cons_start_date[i] <- response$startDate
constituencies$cons_end_date[i] <- ifelse(is.null(response$endDate), NA, response$endDate)
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - basic | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
### Pull shape file
get_cons_shapefile <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id,
"/Geometry")
shapefile <- httr::GET(url) %>%
httr::content("parsed")
return(shapefile)
}
for(i in seq_along(constituencies$constituency_id)) {
response <- get_cons_shapefile(constituencies$constituency_id[i])
response <- response[[1]]
constituencies$shapefile[i] <- response
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - shapefile | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
### Pull election results
get_cons_election_results <- function(cons_id) {
url <- paste0(
"https://members-api.parliament.uk/api/Location/Constituency/",
cons_id,
"/ElectionResults")
results <- httr::GET(url) %>%
httr::content("parsed")
return(results)
}
for (i in seq_along(constituencies$constituency_id)) {
response <- get_cons_election_results(constituencies$constituency_id[i])
response <- response[[1]]
response <- lapply(response, function(lst) {lapply(lst, replace_null_with_na)})
constituencies$last_election_1_electorate[i] <- response[[1]]$electorate
constituencies$last_election_1_turnout[i] <- response[[1]]$turnout
constituencies$last_election_1_majority[i] <- response[[1]]$majority
constituencies$last_election_1_result[i] <- response[[1]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[1]]$winningParty) > 1) { # When there is content in the winning party sublist, the length will be greater than 1
constituencies$last_election_1_winning_party[i] <- response[[1]]$winningParty$id
} else {
constituencies$last_election_1_winning_party[i] <- NA
}
constituencies$last_election_1_election_ID[i] = response[[1]]$electionId
constituencies$last_election_1_electionDate[i] = response[[1]]$electionDate
constituencies$last_election_1_isGeneralElection[i] = response[[1]]$isGeneralElection
constituencies$last_election_2_electorate[i] <- response[[2]]$electorate
constituencies$last_election_2_turnout[i] <- response[[2]]$turnout
constituencies$last_election_2_majority[i] <- response[[2]]$majority
constituencies$last_election_2_result[i] <- response[[2]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[2]]$winningParty) > 1) {
constituencies$last_election_2_winning_party[i] <- response[[2]]$winningParty$id
} else {
constituencies$last_election_2_winning_party[i] <- NA
}
constituencies$last_election_2_election_ID[i] = response[[2]]$electionId
constituencies$last_election_2_electionDate[i] = response[[2]]$electionDate
constituencies$last_election_2_isGeneralElection[i] = response[[2]]$isGeneralElection
constituencies$last_election_3_electorate[i] <- response[[3]]$electorate
constituencies$last_election_3_turnout[i] <- response[[3]]$turnout
constituencies$last_election_3_majority[i] <- response[[3]]$majority
constituencies$last_election_3_result[i] <- response[[3]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[3]]$winningParty) > 1) {
constituencies$last_election_3_winning_party[i] <- response[[3]]$winningParty$id
} else {
constituencies$last_election_3_winning_party[i] <- NA
}
constituencies$last_election_3_election_ID[i] = response[[3]]$electionId
constituencies$last_election_3_electionDate[i] = response[[3]]$electionDate
constituencies$last_election_3_isGeneralElection[i] = response[[3]]$isGeneralElection
constituencies$last_election_4_electorate[i] <- response[[4]]$electorate
constituencies$last_election_4_turnout[i] <- response[[4]]$turnout
constituencies$last_election_4_majority[i] <- response[[4]]$majority
constituencies$last_election_4_result[i] <- response[[4]]$result
# If no winner recorded, skip this and assign NA
if(length(response[[4]]$winningParty) > 1) {
constituencies$last_election_4_winning_party[i] <- response[[4]]$winningParty$id
} else {
constituencies$last_election_4_winning_party[i] <- NA
}
constituencies$last_election_4_election_ID[i] = response[[4]]$electionId
constituencies$last_election_4_electionDate[i] = response[[4]]$electionDate
constituencies$last_election_4_isGeneralElection[i] = response[[4]]$isGeneralElection
Sys.sleep(0.5)
print(paste0("LOG | Constituency API call - elections | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}
saveRDS(constituencies, "data/constituencies_api_raw.Rds")
# This code can also be run by sourcing scripts/04_selinium-scrape-HoC-dashboard.R
# NOTE cons_hoc returns 610 not 472 because it was pulled based on constituencies in all oral questions, not just 2023
# Read in data from the constituency endpoint pull
cons <- readRDS("data/constituencies_api_raw.Rds")
# Make new dataframe
cons <- cons %>%
select(constituency_id, cons_name) %>%
unique() %>% # Keep only unqiue constituencies
mutate( # Initialise variables
region_nation_hoclib23 = NA,
population_hoclib23 = NA,
area_hoclib23 = NA,
age_0_29_hoclib23 = NA,
age_30_64_hoclib23 = NA,
age_65_plus_hoclib23 = NA,
uc_claimants_hoclib23 = NA,
median_house_price_hoclib23 = NA
)
# Check whether constituencies have already been pulled and saved. If they have, filter out these so they are not re-scraped.
# If running for the first time, you will not be able to read in cons_hoc, so the filtering is skipped.
if (file.exists("data/hoc_library_scrape_raw.Rds")) {
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
cons$check_already_pulled <- cons$cons_name %in% cons_hoc$cons_name
cons <- cons %>% filter(check_already_pulled == FALSE)
}
# Set selinium browser
rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL)
driver <- rD$client
# Define a list of css selectors
# The dashboard is contained within an "iframe".
# This allows a different html tree to be embedded within the main html of the webpage meaning any CSS paths do not point to the actual path of the webpage.
# To do this, we need to identify the iframe and use `switchToFrame()` to identify elements on the dashboard.
selector_list <- list()
selector_list$search_dropdown <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[1]/transform/div/div[3]/div/div/visual-modern/div/div/div[2]/div/i"
selector_list$search_box <- "/html/body/div[7]/div[1]/div/div[1]/input"
selector_list$search_result <- "/html/body/div[7]/div[1]/div/div[2]/div/div[1]/div/div/div[1]/div/span"
selector_list$region_nation <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[2]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div"
selector_list$population <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[3]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"
selector_list$area <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[5]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"
selector_list$age_0_29 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[11]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$age_30_64 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[13]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$age_65_plus <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[15]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"
selector_list$uc_claimants <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[28]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[1]/div[1]"
selector_list$house_prices <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[39]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[2]/div[1]"
constituency_dash_scraper <- function(
constituency_name,
wait_base = 1 # Allows user to adjust wait lengths (e.g if running on a slow connection)
# If you get a 'could not find element' error, try adjusting the wait time as the dashboard takes a while to load
){
# Find dropdown box and click on it
search_dropdown <- driver$findElement(using = "xpath", value = selector_list$search_dropdown)
search_dropdown$clickElement()
# Find search box and type constituency name
Sys.sleep(wait_base * 2)
search_box <- driver$findElement(using = "xpath", value = selector_list$search_box)
#search_box$clickElement() # Do not strictly need this, but if not working try uncommenting
search_box$clearElement()
search_box$sendKeysToElement(list(constituency_name))
Sys.sleep(wait_base * 4) # This requires a long time to load.
# Click on the first result to load data
first_result <- driver$findElement(using = "xpath", value = selector_list$search_result)
first_result$clickElement()
Sys.sleep(wait_base * 4) # Wait for data to load
# EXTRACT TEXT FROM ELEMENTS
# Set defaults as NA
region_nation_text <- NA
population_text <- NA
area_text <- NA
age_0_29_text <- NA
age_30_64_text <- NA
age_65_plus_text <- NA
uc_claimants_text <- NA
house_prices_text <- NA
# Region or nation
tryCatch({ # Prevent loop from closing if no data available
suppressMessages({
region_nation <- driver$findElement(using = "xpath", value = selector_list$region_nation)
region_nation_text <- region_nation$getElementText()[[1]]
})
}, error = function(e) {
# Print error message, no need to assign NA as we have set NA as default
print(paste0("Log: NA assigned for REGION/NATION at iteration ", i))
})
# Population
tryCatch({
suppressMessages({
population <- driver$findElement(using = "xpath", value = selector_list$population)
population_text <- population$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for POPULATION at iteration ", i))
})
# Area in sq km
tryCatch({
suppressMessages({
area <- driver$findElement(using = "xpath", value = selector_list$area)
area_text <- area$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AREA at iteration ", i))
})
# Age composition
tryCatch({
suppressMessages({
age_0_29 <- driver$findElement(using = "xpath", value = selector_list$age_0_29)
age_0_29_text <- age_0_29$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 0-29 PLUS at iteration ", i))
})
tryCatch({
suppressMessages({
age_30_64 <- driver$findElement(using = "xpath", value = selector_list$age_30_64)
age_30_64_text <- age_30_64$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 30-64 PLUS at iteration ", i))
})
tryCatch({
suppressMessages({
age_65_plus <- driver$findElement(using = "xpath", value = selector_list$age_65_plus)
age_65_plus_text <- age_65_plus$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for AGE 64 PLUS at iteration ", i))
})
# Universal credit claimants
tryCatch({
suppressMessages({
uc_claimants <- driver$findElement(using = "xpath", value = selector_list$uc_claimants)
uc_claimants_text <- uc_claimants$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for UC CLAIMANTS at iteration ", i))
})
# House price
tryCatch({
suppressMessages({
house_prices <- driver$findElement(using = "xpath", value = selector_list$house_prices)
house_prices_text <- house_prices$getElementText()[[1]]
})
}, error = function(e) {
print(paste0("Log: NA assigned for HOUSE PRICE at iteration ", i))
})
# Merge results into a list
results = list(
region_nation_text,
population_text, area_text,
age_0_29_text, age_30_64_text, age_65_plus_text,
uc_claimants_text, house_prices_text)
return(results)
}
# Run the scraper
# Navigate to home page outside of the loop to avoid reloading each time
driver$navigate("https://commonslibrary.parliament.uk/constituency-dashboard/")
Sys.sleep(1)
# The dashboard exists within a sub-page. Unless we "switch" to this subframe, the css paths will be broken
# Identify and switch to sub-page
iframe <- driver$findElement(using = "xpath", value = "//iframe[@title='Constituency dashboard']")
driver$switchToFrame(iframe)
Sys.sleep(4)
# Set the number to start from in case loop is interuppted but we have cached results
start_from = 1
for (i in start_from:length(cons$constituency_id)) {
results <- constituency_dash_scraper(cons$cons_name[i], wait_base = 1)
cons$region_nation_hoclib23[i] <- results[[1]]
cons$population_hoclib23[i] <- results[[2]]
cons$area_hoclib23[i] <- results[[3]]
cons$age_0_29_hoclib23[i] <- results[[4]]
cons$age_30_64_hoclib23[i] <- results[[5]]
cons$age_65_plus_hoclib23[i] <- results[[6]]
cons$uc_claimants_hoclib23[i] <- results[[7]]
cons$median_house_price_hoclib23[i] <- results[[8]]
# Cache results collected so far
if(i == start_from){
saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
} else {
saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
file.remove(paste0("data/cache_cons_at", i-1, ".Rds")) # delete last cached object
}
Sys.sleep(1)
print(paste0(i, " of ", nrow(cons), " done."))
}
# Kill driver and java processes
driver$close()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
if (file.exists("data/hoc_library_scrape_raw.Rds")) {
cons$check_already_pulled <- NULL
saveRDS(cons, "data/hoc_library_scrape_raw_extra.Rds")
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
cons <- rbind(cons_hoc, cons)
saveRDS(cons, "data/hoc_library_scrape_raw.Rds")
} else {
# Save output
saveRDS(cons, "data/hoc_library_scrape_raw.Rds")
}
# Clean dashboard data
cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
# pop numeric
cons_hoc$population_hoclib23 <- cons_hoc$population_hoclib23 %>%
str_remove_all(",") %>%
as.numeric()
# area numeric
cons_hoc$area_hoclib23 <- cons_hoc$area_hoclib23 %>%
str_extract(".*(?=\\s*sq\\.\\s*km)") %>%
str_remove_all(",") %>%
as.numeric()
# age perc
cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23/100 # Convert to proportion
cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23/100 # Convert to proportion
cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23 %>%
str_remove_all("%") %>%
as.numeric()
cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23/100 # Convert to proportion
# uc numeric
cons_hoc$uc_claimants_hoclib23 <- cons_hoc$uc_claimants_hoclib23 %>%
str_remove_all(",") %>%
as.numeric()
# house price numeric
cons_hoc$median_house_price_hoclib23 <- cons_hoc$median_house_price_hoclib23 %>%
str_remove_all(",|£") %>%
as.numeric()
# Merge API and dashboard data
cons_api <- readRDS("data/constituencies_api_raw.Rds")
cons_hoc <- cons_hoc %>%
select(-cons_name)
cons <- left_join(cons_api, cons_hoc, by = "constituency_id")
# Write out to database
dbWriteTable(db, "constituencies", cons, overwrite = TRUE)
response <- httr::GET("https://members-api.parliament.uk/api/Parties/GetActive/1") %>%
httr::content("parsed")
parties <- response$items
parties <- replace_null_with_na(parties)
for (i in 1:length(parties)) {
if (i == 1) {
parties_df <- data.frame(
party_id = c(parties[[i]]$value$id),
party_name = c(parties[[i]]$value$name),
party_abbreviation = c(parties[[i]]$value$abbreviation),
party_colour = c(parties[[i]]$value$backgroundColour)
)
} else {
parties_df2 <- data.frame(
party_id = c(parties[[i]]$value$id),
party_name = c(parties[[i]]$value$name),
party_abbreviation = c(parties[[i]]$value$abbreviation),
party_colour = c(parties[[i]]$value$backgroundColour)
)
parties_df <- rbind(parties_df, parties_df2)
}
}
parties_df <- parties_df %>%
mutate(
party_id = as.character(party_id),
party_name = as.character(party_name),
party_abbreviation = as.character(party_abbreviation),
party_colour = as.character(party_colour))
dbWriteTable(db, "parties", parties_df, overwrite = TRUE)
question_text <- dbGetQuery(db,
"
SELECT question_id, question_text
FROM questions
"
) %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
# Measure
# Initalise variables as NA
question_text$is_econ <- NA
question_text$is_health_welf <- NA
# clean question text
question_text$question_text <- question_text$question_text %>%
tolower() %>% # Convert to lower case
tm::removePunctuation() # remove punctuation
# Define dictionaries
# NOTE Citation: Albugh, Quinn, Julie Sevenans and Stuart Soroka. 2013. Lexicoder Topic Dictionaries, June 2013 versions, McGill University, Montreal, Canada.
# Download Lexicon Policy Topic Dictionaries
if (!file.exists("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd")) {
dir.create("data/lexicoder_dictionaries")
download.file(
"https://www.snsoroka.com/s/LTDjun2013.zip",
"data/lexicoder_dictionaries/policy_topics.zip")
unzip(
"data/lexicoder_dictionaries/policy_topics.zip",
exdir = "data/lexicoder_dictionaries", overwrite = TRUE)
}
# NOTE ChatGPT used to write code that parses XML
parsed_string <- readLines("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd") %>%
paste(collapse = "\n") %>%
xmlTreeParse(useInternalNodes = TRUE)
extract_pnodes <- function(cnode) {
sapply(xpathApply(cnode, "./pnode"), function(pnode) {
xmlAttrs(pnode)[["name"]]
})
}
# Extract cnodes and their corresponding pnodes
dictionaries_output <- xpathApply(parsed_string, "//cnode", function(cnode) {
cnode_name <- xmlAttrs(cnode)[["name"]]
pnodes <- extract_pnodes(cnode)
return(list(cnode_name = cnode_name, pnodes = pnodes))
})
# Convert to a single list
dictionaries <- list()
for (item in dictionaries_output) {
cnode_name <- item$cnode_name
pnodes <- item$pnodes
dictionaries[[cnode_name]] <- pnodes
}
rm(cnode_name, dictionaries_output, item, parsed_string, pnodes, raw_string)
econ_dict <- c( # Select relevant dictionaries
dictionaries$macroeconomics,
dictionaries$finance,
dictionaries$foreign_trade
)
# Convert to regex string and convert to lower for matching
econ_dict <- econ_dict %>%
paste(collapse="|") %>%
tolower()
health_welf_dict <- c( # Select relevant dictionaries
dictionaries$healthcare,
dictionaries$social_welfare
)
health_welf_dict <- health_welf_dict %>%
paste(collapse="|") %>%
tolower()
question_text <- question_text %>%
mutate(
is_econ = NA,
is_health_welf = NA
) %>%
mutate(
is_econ =
ifelse(
str_detect(question_text, econ_dict), 1, 0),
is_health_welf =
ifelse(
str_detect(question_text, health_welf_dict), 1, 0)
)
mean(question_text$is_econ)
mean(question_text$is_health_welf)
dbWriteTable(db, "question_topics", question_text, overwrite = TRUE)
dbListTables(db)
db_table_check("questions")
db_table_check("members")
db_table_check("constituencies")
db_table_check("parties")
db_table_check("question_topics")
analysis_df <- dbGetQuery(
db,
"
SELECT
members.name_display AS MP,
parties.party_abbreviation AS party_abbreviation,
SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,
constituencies.uc_claimants_hoclib23 AS uc_claimants,
constituencies.median_house_price_hoclib23 AS median_house_price,
constituencies.population_hoclib23 / constituencies.area_hoclib23 AS density,
constituencies.age_0_29_hoclib23 AS age_29,
constituencies.age_30_64_hoclib23 AS age_30_64,
constituencies.age_65_plus_hoclib23 AS age_65,
/* Majority */
constituencies.last_election_1_majority,
constituencies.last_election_2_majority,
constituencies.last_election_3_majority,
constituencies.last_election_4_majority,
constituencies.last_election_1_electorate,
constituencies.last_election_2_electorate,
constituencies.last_election_3_electorate,
constituencies.last_election_4_electorate,
/* results */
constituencies.last_election_1_result,
constituencies.last_election_2_result,
constituencies.last_election_3_result,
constituencies.last_election_4_result
FROM questions
JOIN question_topics ON questions.question_id = question_topics.question_id
LEFT JOIN members ON questions.asking_member = members.member_id
/* this has to be joined before anything */
/* from members to avoid dropping rows */
/* select row where date of question comes between the dates valid range */
AND REPLACE(questions.question_tabled_when, '-', '')
/* no date class in SQLite, so convert to string*/
BETWEEN REPLACE(members.member_date_valid_min, '-', '')
AND REPLACE(members.member_date_valid_max, '-', '')
LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id
LEFT JOIN parties ON parties.party_id = members.latest_party_id
GROUP BY members.member_id
"
) %>%
replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
# Convert majority variables into +/- depending on whether current MP won or lost
analysis_df <- analysis_df %>%
mutate(
last_election_1_majority =
ifelse(str_detect(last_election_1_result, party_abbreviation),
last_election_1_majority,
last_election_1_majority * -1),
last_election_2_majority =
ifelse(str_detect(last_election_2_result, party_abbreviation),
last_election_2_majority,
last_election_2_majority * -1),
last_election_3_majority =
ifelse(str_detect(last_election_3_result, party_abbreviation),
last_election_3_majority,
last_election_3_majority * -1),
last_election_4_majority =
ifelse(str_detect(last_election_4_result, party_abbreviation),
last_election_4_majority,
last_election_4_majority * -1)
) %>%
select(-c(last_election_1_result, last_election_2_result, last_election_3_result, last_election_4_result))
# Calculate marginality
analysis_df <- analysis_df %>%
mutate(
marginality_1 = (last_election_1_majority / last_election_1_electorate),
marginality_2 = (last_election_2_majority / last_election_2_electorate),
marginality_3 = (last_election_3_majority / last_election_3_electorate),
marginality_4 = (last_election_4_majority / last_election_4_electorate)
) %>%
select(-c(last_election_1_majority, last_election_2_majority, last_election_3_majority, last_election_4_majority, last_election_1_electorate, last_election_2_electorate,last_election_3_electorate,last_election_4_electorate))
# Calcualte mean marginality
analysis_df <- analysis_df %>%
mutate(
mean_marginality = rowMeans(select(., starts_with("marginality_")))
) %>%
select(-c(marginality_1, marginality_2, marginality_3, marginality_4))
analysis_df <- analysis_df %>%
mutate(econ_slant = econ_prop - health_welf_prop)
hist <- analysis_df %>%
ggplot() +
geom_vline(xintercept = 0)+
geom_histogram(aes(x=econ_slant), alpha = 0.7) +
xlim(-1,1)+
labs(
title = "Distribution of oral and written question slant",
x = "Slant", y = "Count") +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
hist
geog_data <- dbGetQuery(db,
"
SELECT
constituencies.cons_name AS constituency,
SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,
constituencies.shapefile AS con_shapefile
FROM questions
LEFT JOIN members ON questions.asking_member = members.member_id
/* this has to be joined before anything */
/* from members to avoid dropping rows */
/* select row where date of question comes between the dates valid range */
AND REPLACE(questions.question_tabled_when, '-', '')
/* no date class in SQLite, so convert to string*/
BETWEEN REPLACE(members.member_date_valid_min, '-', '')
AND REPLACE(members.member_date_valid_max, '-', '')
LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id
LEFT JOIN question_topics ON questions.question_id = question_topics.question_id
GROUP BY constituencies.cons_name
"
) %>%
replace_na_chr() %>% # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
filter(!is.na(con_shapefile)) # Drop constituencies without shapefiles
geog_data <- geog_data %>%
mutate(econ_slant = econ_prop - health_welf_prop)
# Add all constituencies as base map
# To plot the base map, we want all constituencies, not just ones where questions have been asked.
# We can get this from OSMaps
# For reproducibility, the following code downloads and processes the data programmatically
# To download manually, use https://osdatahub.os.uk/downloads/open/BoundaryLine
if (!file.exists("data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")) {
options(timeout=600) # Takes some time to download so we need to increase the "timeout" setting
dir.create("data/whole_UK_shapefile")
download.file("https://api.os.uk/downloads/v1/products/BoundaryLine/downloads?area=GB&format=ESRI%C2%AE+Shapefile&redirect", "data/whole_UK_shapefile/OS_zip.zip") #
options(timeout=60) # Reset timeout
# unzip
unzip(
"data/whole_UK_shapefile/OS_zip.zip",
files = c(
"Data/GB/westminster_const_region.dbf",
"Data/GB/westminster_const_region.prj",
"Data/GB/westminster_const_region.shp",
"Data/GB/westminster_const_region.shx"),
exdir = "data/whole_UK_shapefile",
overwrite = TRUE)
file.remove("data/whole_UK_shapefile/OS_zip.zip")
}
basemap_sf <-
st_read(
dsn = "data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")
# Make shape files for constituencies in the database
# NOTE: approach to converting from GeoJSON from ChatGPT
temp_geojson <- tempfile(fileext = ".geojson") # Create a temporary file
writeLines(geog_data$con_shapefile, con = temp_geojson) # Write out to temporary file
geog_sf <- st_read(dsn = temp_geojson) # Read the GeoJSON file into an sf object
unlink(temp_geojson) # Delete temporary file
geog_data <- cbind(geog_data, geog_sf)
geog_data <- st_as_sf(geog_data) # Convert to SF
map <-
tm_shape(basemap_sf) +
tm_sf(col = "white") +
tm_shape(geog_data) +
tm_polygons(
col = "econ_slant",
style = "cont",
midpoint = 0,
title = "Question topic slant",
palette = "RdBu",
#legend.hist = TRUE
) +
tm_legend(
legend.position = c("right", "center"),
legend.title.size = 5,
legend.text.size = 4
) +
tm_layout(
title = "Distribution of question topic slant across UK Constituencies.\n\n",
title.size = 9
)
map
lm_density <- summary(lm(econ_slant ~ density, data = analysis_df))
slope_density <- lm_density$coefficients[[2,1]]
se_density <- lm_density$coefficients[[2,2]]
plot_density <- analysis_df %>%
ggplot(aes(x=density, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=15000, y= 0.5),
label = paste0(
"Slope: ", signif(slope_density, digits = 3),
"\n SE: ", signif(se_density, digits = 3)
),
size = 3)+
labs(
title = "Correlation between population density of an MP's consitutency and their question slant in 2023",
x = "Population density in 2023 (people per km^2)",
y = "Slant") +
ylim(-1,1) +
theme(
panel.background = element_rect(fill = "white", color = "black"),
aspect.ratio = 1
)
plot_density
# Run linear model
lm_uc <- summary(lm(econ_slant ~ uc_claimants, data = analysis_df))
# Save slope coefficients and SEs for plotting
slope_uc <- lm_uc$coefficients[[2,1]]
se_uc <- lm_uc$coefficients[[2,2]]
# Make plots
plot_uc <- analysis_df %>%
ggplot(aes(x=uc_claimants, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=5000, y= 0.8),
label = paste0(
"Slope: ", signif(slope_uc, digits = 3),
"\n SE: ", signif(se_uc, digits = 3)
),
size = 3) +
labs(
title = "Correlation between the number of households claiming Universal Credit in an MP's consitutency and their question slant in 2023",
x = "Proportion of population claiming Universal Credit",
y = "Slant") +
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
# Run linear model
lm_marg <- summary(lm(econ_slant ~ mean_marginality, data = analysis_df))
slope_marg <- lm_marg$coefficients[[2,1]]
se_marg <- lm_marg$coefficients[[2,2]]
# Make plot
plot_marginality <- analysis_df %>%
ggplot(aes(x=mean_marginality, y = econ_slant)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text(
aes(x=-0.2, y= 0.5),
label = paste0(
"Slope: ", signif(slope_marg, digits = 3),
"\n SE: ", signif(se_marg, digits = 3)
),
size = 3) +
labs(
title = "Correlation between an MP's seat marginality and their question slant in 2023",
x = "Seat marginality",
y = "Slant") +
ylim(-1,1) +
theme(
aspect.ratio = 1,
panel.background = element_rect(fill = "white", color = "black"),
panel.grid = element_blank())
plot_marginality
# Disconnect from local database
DBI::dbDisconnect(db)
sessionInfo()R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/London
tzcode source: system (glibc)
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] gridExtra_2.3 rpart.plot_3.1.1 rpart_4.1.21 vip_0.4.1
[5] yardstick_1.2.0 workflowsets_1.0.1 workflows_1.1.3 tune_1.1.2
[9] rsample_1.2.0 recipes_1.0.9 parsnip_1.1.1 modeldata_1.2.0
[13] infer_1.0.5 dials_1.2.0 scales_1.3.0 broom_1.0.5
[17] tidymodels_1.1.1 ranger_0.16.0 ggrepel_0.9.4 sf_1.0-15
[21] tmap_3.3-4 XML_3.99-0.16 tm_0.7-11 NLP_0.2-1
[25] RSelenium_1.7.9 httr_1.4.7 RSQLite_2.3.4 DBI_1.1.3
[29] here_1.0.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[33] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[37] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8
[4] semver_0.2.0 magrittr_2.0.3 farver_2.1.1
[7] rmarkdown_2.25 vctrs_0.6.5 memoise_2.0.1
[10] base64enc_0.1-3 terra_1.7-65 htmltools_0.5.7
[13] leafsync_0.1.0 raster_3.6-26 parallelly_1.36.0
[16] KernSmooth_2.23-22 htmlwidgets_1.6.4 stars_0.6-4
[19] cachem_1.0.8 iterators_1.0.14 lifecycle_1.0.4
[22] pkgconfig_2.0.3 Matrix_1.6-3 R6_2.5.1
[25] fastmap_1.1.1 future_1.33.1 digest_0.6.33
[28] colorspace_2.1-0 furrr_0.3.1 wdman_0.2.6
[31] rprojroot_2.0.4 leafem_0.2.3 crosstalk_1.2.1
[34] labeling_0.4.3 lwgeom_0.2-13 fansi_1.0.5
[37] timechange_0.2.0 mgcv_1.9-0 abind_1.4-5
[40] compiler_4.3.2 proxy_0.4-27 bit64_4.0.5
[43] withr_2.5.2 backports_1.4.1 MASS_7.3-60
[46] lava_1.7.3 tmaptools_3.1-1 leaflet_2.2.1
[49] classInt_0.4-10 caTools_1.18.2 tools_4.3.2
[52] units_0.8-5 future.apply_1.11.1 nnet_7.3-19
[55] glue_1.6.2 nlme_3.1-163 grid_4.3.2
[58] generics_0.1.3 gtable_0.3.4 tzdb_0.4.0
[61] class_7.3-22 data.table_1.14.8 hms_1.1.3
[64] sp_2.1-2 xml2_1.3.6 utf8_1.2.4
[67] foreach_1.5.2 pillar_1.9.0 lhs_1.1.6
[70] splines_4.3.2 lattice_0.22-5 survival_3.5-7
[73] bit_4.0.5 tidyselect_1.2.0 knitr_1.45
[76] xfun_0.41 hardhat_1.3.0 timeDate_4032.109
[79] stringi_1.8.2 DiceDesign_1.10 yaml_2.3.7
[82] evaluate_0.23 codetools_0.2-19 cli_3.6.1
[85] munsell_0.5.0 dichromat_2.0-0.1 Rcpp_1.0.11
[88] globals_0.16.2 png_0.1-8 binman_0.1.3
[91] gower_1.0.1 assertthat_0.2.1 blob_1.2.4
[94] bitops_1.0-7 GPfit_1.0-8 listenv_0.9.0
[97] viridisLite_0.4.2 slam_0.1-50 ipred_0.9-14
[100] prodlim_2023.08.28 e1071_1.7-14 rlang_1.1.2